OcrV1, Main, Exploration, bibRecord, 000E73

Detecting Misspelled Words in Turkish Text Using Syllable n -gram Frequencies

Identifieur interne : 000E73 ( Main/Exploration ); précédent : 000E72; suivant : 000E74

Detecting Misspelled Words in Turkish Text Using Syllable n -gram Frequencies

Auteurs : Rifat A Liyan [Turquie] ; Korhan Günel [Turquie] ; Tatyana Yakhno [Turquie]

Source :

Lecture Notes in Computer Science [ 0302-9743 ] ; 2007.

RBID : ISTEX:4EB20517731C5FBD417830DB1BCAA19F7A3C1C06

Abstract

Abstract: In this study, we have designed and implemented a system which decides whether or not a word is misspelled in Turkish text. Firstly, three databases of syllable monogram, bigram and trigram frequencies are constructed using the syllables that are derived from five different Turkish corpora. Then, the system takes words in Turkish text as an input and computes the probability distribution of words using syllable monogram, bigram and trigram frequencies from the databases. If the probability distribution of a word is zero, it is decided that this word is misspelled. For testing the system, we have constructed two text databases with the same words. One text database has 685 misspelled words. The other has 685 correctly spelled words. The words from these text databases are taken as inputs for the system. The system produces two results for each word: “Correctly spelled word” or “Misspelled word”. The system that is designed with monogram and bigram frequencies has 86% success rate for the misspelled words and has 88% success rate for the correctly spelled words. According to the system designed with bigram and trigram frequencies, there is 97% success rate for the misspelled words and there is 98% success rate for the correctly spelled words.

Url:

https://api.istex.fr/document/4EB20517731C5FBD417830DB1BCAA19F7A3C1C06/fulltext/pdf

DOI: 10.1007/978-3-540-77046-6_68

Affiliations:

Turquie

Links toward previous steps (curation, corpus...)

to stream Istex, to step Corpus: 000B06
to stream Istex, to step Curation: 000A93
to stream Istex, to step Checkpoint: 000888
to stream Main, to step Merge: 000E86
to stream Main, to step Curation: 000E73

Le document en format XML

<record><TEI wicri:istexFullTextTei="biblStruct"><teiHeader><fileDesc><titleStmt><title xml:lang="en">Detecting Misspelled Words in Turkish Text Using Syllable n -gram Frequencies</title>
<author><name sortKey="A Liyan, Rifat" sort="A Liyan, Rifat" uniqKey="A Liyan R" first="Rifat" last="A Liyan">Rifat A Liyan</name>
</author>
<author><name sortKey="Gunel, Korhan" sort="Gunel, Korhan" uniqKey="Gunel K" first="Korhan" last="Günel">Korhan Günel</name>
</author>
<author><name sortKey="Yakhno, Tatyana" sort="Yakhno, Tatyana" uniqKey="Yakhno T" first="Tatyana" last="Yakhno">Tatyana Yakhno</name>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">ISTEX</idno>
<idno type="RBID">ISTEX:4EB20517731C5FBD417830DB1BCAA19F7A3C1C06</idno>
<date when="2007" year="2007">2007</date>
<idno type="doi">10.1007/978-3-540-77046-6_68</idno>
<idno type="url">https://api.istex.fr/document/4EB20517731C5FBD417830DB1BCAA19F7A3C1C06/fulltext/pdf</idno>
<idno type="wicri:Area/Istex/Corpus">000B06</idno>
<idno type="wicri:Area/Istex/Curation">000A93</idno>
<idno type="wicri:Area/Istex/Checkpoint">000888</idno>
<idno type="wicri:doubleKey">0302-9743:2007:A Liyan R:detecting:misspelled:words</idno>
<idno type="wicri:Area/Main/Merge">000E86</idno>
<idno type="wicri:Area/Main/Curation">000E73</idno>
<idno type="wicri:Area/Main/Exploration">000E73</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title level="a" type="main" xml:lang="en">Detecting Misspelled Words in Turkish Text Using Syllable n -gram Frequencies</title>
<author><name sortKey="A Liyan, Rifat" sort="A Liyan, Rifat" uniqKey="A Liyan R" first="Rifat" last="A Liyan">Rifat A Liyan</name>
<affiliation wicri:level="1"><country xml:lang="fr">Turquie</country>
<wicri:regionArea>Dokuz Eylül University, lzmir</wicri:regionArea>
<wicri:noRegion>lzmir</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Gunel, Korhan" sort="Gunel, Korhan" uniqKey="Gunel K" first="Korhan" last="Günel">Korhan Günel</name>
<affiliation wicri:level="1"><country xml:lang="fr">Turquie</country>
<wicri:regionArea>Dokuz Eylül University, lzmir</wicri:regionArea>
<wicri:noRegion>lzmir</wicri:noRegion>
</affiliation>
</author>
<author><name sortKey="Yakhno, Tatyana" sort="Yakhno, Tatyana" uniqKey="Yakhno T" first="Tatyana" last="Yakhno">Tatyana Yakhno</name>
<affiliation wicri:level="1"><country xml:lang="fr">Turquie</country>
<wicri:regionArea>Dokuz Eylül University, lzmir</wicri:regionArea>
<wicri:noRegion>lzmir</wicri:noRegion>
</affiliation>
</author>
</analytic>
<monogr></monogr>
<series><title level="s">Lecture Notes in Computer Science</title>
<imprint><date>2007</date>
</imprint>
<idno type="ISSN">0302-9743</idno>
<idno type="eISSN">1611-3349</idno>
<idno type="ISSN">0302-9743</idno>
</series>
<idno type="istex">4EB20517731C5FBD417830DB1BCAA19F7A3C1C06</idno>
<idno type="DOI">10.1007/978-3-540-77046-6_68</idno>
<idno type="ChapterID">68</idno>
<idno type="ChapterID">Chap68</idno>
</biblStruct>
</sourceDesc>
<seriesStmt><idno type="ISSN">0302-9743</idno>
</seriesStmt>
</fileDesc>
<profileDesc><textClass></textClass>
<langUsage><language ident="en">en</language>
</langUsage>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en">Abstract: In this study, we have designed and implemented a system which decides whether or not a word is misspelled in Turkish text. Firstly, three databases of syllable monogram, bigram and trigram frequencies are constructed using the syllables that are derived from five different Turkish corpora. Then, the system takes words in Turkish text as an input and computes the probability distribution of words using syllable monogram, bigram and trigram frequencies from the databases. If the probability distribution of a word is zero, it is decided that this word is misspelled. For testing the system, we have constructed two text databases with the same words. One text database has 685 misspelled words. The other has 685 correctly spelled words. The words from these text databases are taken as inputs for the system. The system produces two results for each word: “Correctly spelled word” or “Misspelled word”. The system that is designed with monogram and bigram frequencies has 86% success rate for the misspelled words and has 88% success rate for the correctly spelled words. According to the system designed with bigram and trigram frequencies, there is 97% success rate for the misspelled words and there is 98% success rate for the correctly spelled words.</div>
</front>
</TEI>
<affiliations><list><country><li>Turquie</li>
</country>
</list>
<tree><country name="Turquie"><noRegion><name sortKey="A Liyan, Rifat" sort="A Liyan, Rifat" uniqKey="A Liyan R" first="Rifat" last="A Liyan">Rifat A Liyan</name>
</noRegion>
<name sortKey="Gunel, Korhan" sort="Gunel, Korhan" uniqKey="Gunel K" first="Korhan" last="Günel">Korhan Günel</name>
<name sortKey="Yakhno, Tatyana" sort="Yakhno, Tatyana" uniqKey="Yakhno T" first="Tatyana" last="Yakhno">Tatyana Yakhno</name>
</country>
</tree>
</affiliations>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Ticri/CIDE/explor/OcrV1/Data/Main/Exploration

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000E73 | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Main/Exploration/biblio.hfd -nk 000E73 | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Ticri/CIDE
   |area=    OcrV1
   |flux=    Main
   |étape=   Exploration
   |type=    RBID
   |clé=     ISTEX:4EB20517731C5FBD417830DB1BCAA19F7A3C1C06
   |texte=   Detecting Misspelled Words in Turkish Text Using Syllable n -gram Frequencies
}}

This area was generated with Dilib version V0.6.32.
Data generation: Sat Nov 11 16:53:45 2017. Site generation: Mon Mar 11 23:15:16 2024

	Serveur d'exploration sur l'OCR
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur l'OCR

Detecting Misspelled Words in Turkish Text Using Syllable n -gram Frequencies

Detecting Misspelled Words in Turkish Text Using Syllable n -gram Frequencies

Source :

Abstract

Links toward previous steps (curation, corpus...)

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri